Eigen-Distortions of Hierarchical Representations
نویسندگان
چکیده
We develop a method for comparing hierarchical image representations in terms of their ability to explain perceptual sensitivity in humans. Specifically, we utilize Fisher information to establish a model-derived prediction of sensitivity to local perturbations of an image. For a given image, we compute the eigenvectors of the Fisher information matrix with largest and smallest eigenvalues, corresponding to the model-predicted mostand least-noticeable image distortions, respectively. For human subjects, we then measure the amount of each distortion that can be reliably detected when added to the image, and compare these thresholds to the predictions of the corresponding model. We use this method to test the ability of a variety of representations to mimic human perceptual sensitivity. We find that the early layers of VGG16, a deep neural network optimized for object recognition, provide a better match to human perception than later layers, and a better match than a 4-stage convolutional neural network (CNN) trained on a database of human ratings of distorted image quality. On the other hand, we find that simple models of early visual processing, incorporating one or more stages of local gain control, trained on the same database of distortion ratings, provide substantially better predictions of human sensitivity than both the CNN and all layers of VGG16. Human capabilities for recognizing complex visual patterns are believed to arise through a cascade of transformations, implemented by neurons in successive stages in the visual system. Several recent studies have suggested that representations of deep convolutional neural networks trained for object recognition can predict activity in areas of the primate ventral visual stream better than models constructed explicitly for that purpose (Yamins et al. [2014], Khaligh-Razavi and Kriegeskorte [2014]). These results have inspired exploration of deep networks trained on object recognition as models of human perception, explicitly employing their representations as perceptual metrics or loss functions (Hénaff and Simoncelli [2016], Johnson et al. [2016], Dosovitskiy and Brox [2016]). On the other hand, several other studies have used synthesis techniques to generate images that indicate a profound mismatch between the sensitivity of these networks and that of human observers. Specifically, Szegedy et al. [2013] constructed image distortions, imperceptible to humans, that cause their networks to grossly misclassify objects. Similarly, Nguyen and Clune [2015] optimized randomly initialized images to achieve reliable recognition from a network, but found that the resulting ‘fooling images’ were uninterpretable by human viewers. Simpler networks, designed ∗Currently at Google, Inc. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. ar X iv :1 71 0. 02 26 6v 2 [ cs .C V ] 4 N ov 2 01 7 for texture classification and constrained to mimic the early visual system, do not exhibit such failures (Portilla and Simoncelli [2000]). These results have prompted efforts to understand why generalization failures of this type are so consistent across deep network architectures, and to develop more robust training methods to defend networks against attacks designed to exploit these weaknesses (Goodfellow et al. [2014]). From the perspective of modeling human perception, these synthesis failures suggest that representational spaces within deep neural networks deviate significantly from that of humans, and that methods for comparing representational similarity, based on fixed object classes and discrete sampling of the representational space, may be insufficient to expose these failures. If we are going to use such networks as models for human perception, we need better methods of comparing model representations to human vision. Recent work has analyzed deep networks’ robustness to visual distortions on classification tasks, as well as the similarity of classification errors that humans and deep networks make in the presence of the same kind of distortion (Dodge and Karam [2017]). Here, we aim to accomplish something in the same spirit, but rather than testing on a set of handselected examples, we develop a model-constrained synthesis method for generating targeted test stimuli that can be used to compare the layer-wise representational sensitivity of a model to human perceptual sensitivity. Utilizing Fisher information, we isolate the model-predicted most and least noticeable changes to an image. We test the quality of these predictions by determining how well human observers can discriminate these same changes. We test the power of this method on six layers of VGG16 (Simonyan and Zisserman [2015]), a deep convolutional neural network (CNN) trained to classify objects. We also compare these results to those derived from models explicitly trained to predict human sensitivity to image distortions, including both a 4-stage generic CNN, a fine-tuned version of VGG16, and a family of highly-structured models explicitly constructed to mimic the physiology of the early human visual system. Example images from the paper, as well as additional examples, can be found online at http://www.cns.nyu.edu/~lcv/eigendistortions/. 1 Predicting discrimination thresholds Suppose we have a model for human visual representation, defined by conditional density p(~r|~x), where ~x is an N -dimensional vector containing the image pixels, and ~r is an M -dimensional random vector representing responses internal to the visual system. If the image is modified by the addition of a distortion vector, ~x + αû, where û is a unit vector, and scalar α controls the amplitude of distortion, the model can be used to predict the threshold at which the distorted image can be reliably distinguished from the original image. Specifically, one can express a lower bound on the discrimination threshold in direction û for any observer or model that bases its judgments on ~r (Seriès et al. [2009]): T (û; ~x) ≥ β √ ûTJ−1[~x]û (1) where β is a scale factor that depends on the noise amplitude of the internal representation (as well as experimental conditions, when measuring discrimination thresholds of human observers), and J [~x] is the Fisher information matrix (FIM; Fisher [1925]), a second-order expansion of the log likelihood: J [~x] = E~r|~x [( ∂ ∂~x log p(~r|~x) )( ∂ ∂~x log p(~r|~x) )T] (2) Here, we restrict ourselves to models that can be expressed as a deterministic (and differentiable) mapping from the input pixels to mean output response vector, f(~x), with additive white Gaussian noise in the response space. The log likelihood in this case reduces to a quadratic form: log p(~r|~x) = − 2 ( [~r − f(~x)] [~r − f(~x)] ) + const. Substituting this into Eq. (2) gives: J [~x] = ∂f ∂~x T ∂f ∂~x Thus, for these models, the Fisher information matrix induces a locally adaptive Euclidean metric on the space of images, as specified by the Jacobian matrix, ∂f/∂~x.
منابع مشابه
Distortions in Cognitive Maps
Cognitive maps refer to mental representations of maps or environments, as revealed in a variety of tasks. The simplest model of cognitive maps is that they are random degradations of real ones. Research using distance judgments, direction judgments, map recognition, map construction, and other information from memory for maps or environments suggests that distortions, rather than being random,...
متن کاملEigenvector-based intergroup connection of low rank for hierarchical multi-agent dynamical systems
This paper proposes an eigenvector-based method for analysis and design of hierarchical networks for multi-agent systems. We first define the concept of eigen-connection by characterizing low rank information flow between layers based on the eigenvector of lower level interconnection structures. It is shown that the resulting intergroup interconnections affect only a few eigenvalues of intercon...
متن کاملEigen Analysis of Networks
We present an overview of eigen analysis methods and their applications to network analysis. We consider several network analysis programs/procedures (Correspondence Analysis, NEGOPY, CONCOR, CONVAR, Bonacich centrality) that are at their core eigendecomposition methods. We discuss the various matrix representations of networks used by these procedures and we give particular attention to a vari...
متن کاملAutomorphic Representations and L-functions
• Decomposition by central characters • Square-integrable cuspforms • Smoothness of cuspforms • Eigen-cuspforms and automorphic representation • Dirichlet series versus zeta and L-functions • L-functions defined via local data • Factoring unitary representations of adele groups • Spherical representations and Satake parameters • Local data, L-groups, higher L-functions References and historical...
متن کاملDistortions of Body Representations 1 RUNNING HEAD: Distortions of Body Representations Distorted Body Representations in Healthy Cognition
متن کامل
Comparison of Five 3D Surface Texture Synthesis Methods
1 Junyu Dong and Mike Chantler are with the Texture Lab, Heriot-Watt University, Edinburgh, Scotland([email protected]) Abstract We present and compare five approaches for synthesizing and relighting real 3D surface textures. We adapted Efros’s texture quilting method and combined it with five different relighting representations, comprising: a set of three photometric images; surface gradi...
متن کامل